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Abstract 

We have developed an alignment-free method that calculates phylogenetic distances us- 
ing a maximum likelihood approach for a model of sequence change on patterns that are dis- 
covered in unaligned sequences. To evaluate the phylogenetic accuracy of our method, and 
to conduct a comprehensive comparison of existing alignment-free methods (freely avail- 
able as Python package decaf +py at http : / /www . bioinf ormatics . org . au), we 
have created a dataset of reference trees covering a wide range of phylogenetic distances. 
Amino acid sequences were evolved along the trees and input to the tested methods; from 
their calculated distances we infered trees whose topologies we compared to the reference 
trees. 

We find our pattern-based method statistically superior to all other tested alignment- 
free methods on this dataset. We also demonstrate the general advantage of alignment-free 
methods over an approach based on automated alignments when sequences violate the as- 
sumption of collinearity. Similarly, we compare methods on empirical data from an existing 
alignment benchmark set that we used to derive reference distances and trees. Our pattern- 
based approach yields distances that show a linear relationship to reference distances over a 
substantially longer range than other alignment-free methods. The pattern-based approach 
outperforms alignment-free methods and its phylogenetic accuracy is statistically indistin- 
guishable from alignment-based distances. 

Key words: alignment-free methods, phylogenetics, distance estimation, pattern discovery 

1 Introduction 

Tasks like database searching, sequence classification, phylogenetic tree reconstruction and de- 
tection of regulatory sequences are ubiquitous in bioinformatics. Most methods performing these 
tasks are based on (automated) alignments; however, alignment-free methods exist for solving 
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the tasks. Recent years have seen an increasing number of alignment-free methods (reviewed 
in Vinga & Almeida, 2003, see also Snel et al., 2005; Vinga et al., 2004; Van Helden, 2004; 
Pham & Zuegg, 2004; Burstein et al., 2005; Wu et al., 2005). In contrast to methods based on 
(automated) alignments, alignment-free methods make fewer assumptions about the nature of 
the sequences they work on, and so far are mostly devoid of any evolutionary model of sequence 
change (the only exception being the W-metric by Vinga et al., 2004). The absence of an assump- 
tion of collinearity over long stretches (implicit in any alignment) destines them to be especially 
useful for handling DNA sequences that have undergone recombination, proteins with shuffled 
domains, and genomic sequences (which often feature large-scale rearrangements). 

Previously, several alignment-free methods have been compared systematically for classifi- 
cation purposes and their ability to detect regulatory sequences. However, surprisingly little is 
known about their accuracy in phylogeny reconstruction. So far, most new methods have been 
verified on a few trees only. A systematic study is sorely lacking in this field of research. 

In Section 2 below we describe several alignment-free methods that we included in our com- 
parison. We propose a new alignment-free method based on patterns in sequences, and a variant 
thereof, in Section 3. The phylogenetic accuracy of these methods is comparatively evaluated 
on synthetic and empirical data, covering a wide range of phylogenetic distances, and we assess 
whether differences are statistically significant in Section 4 before presenting conclusions. 

2 Previous work 

In this section, we provide a summary of previously established alignment-free methods. Some 
methods were reviewed in Vinga and Almeida (2003), while others are more recent and are 
compared here for the first time. 

We represent a biological sequence by a string X of n characters taken from the alphabet 
A which contains c different characters {a\, . . . ,a c }, e.g. all amino acids. Most alignment-free 
methods operate on words of length k, so-called fc-mers: there are w — c k such different words. 
We represent the set of fc-mers in X (or a derived property) by vector = (vf ,...,v^); the 
parameter k is always implied. Each vector element describes the abundance of fc-mer i. 

The (squared) Euclidean distance was introduced into sequence comparison by Blaisdell 
(1986). The distance between X and Y is calculated using cf, the count of fc-mer occurrences in 
X. 

d E (xj) = jr(cf-cj) 2 (i) 
(=i 

Later, Blaisdell (1989) found that d E yields values about twice the number of mismatch 
counts obtained from alignments. 

The standardized Euclidean distance was found to improve on d E without incurring the com- 
putational problems associated with the slightly better performing Mahalanobis distance (Wu et 
al., 1997). 
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d s {XJ) = ^{ff/sf-fJ/s]f (2) 
i=\ 

Divide ff, the relative frequencies of fc-mer occurrences in X, by their standard deviations 
sf as calculated from a set of equilibrium frequencies (Gentleman & Mullin, 1989). 

Edgar (2004a) described the fractional common k-mer count; it is used in a distance measure 
that speeds up guide tree construction in MUSCLE (Edgar, 2004b). Let Cf ¥ = min(cf ,cf) 
denote the common fc-mer count, and J 7 be a string with m characters. 

w 

F(X,Y) = £cf7[min(n,m) -k+ 1] (3) 

(=l 

d F (XJ) = -\og(e+F) (4) 

F, the fraction of common fc-mers between X and Y, ranges from to 1 and d F transforms 
this into a distance: e, a small value added to prevent taking the logarithm of zero (at least in 
Edgar, 2004a), is 0.1 there but 0.02 in MUSCLE. Both versions employ d F in slightly different 
ways; here, we directly use this common basis with e = 0. 1. 

Van Helden (2004) compared several metrics for their suitability in classifying genes based on 
their regulatory sequences. He found a similarity measure based on probabilities from common 
fc-mer counts under a multiplicative Poisson model to be best-performing. In our adaptation, we 
directly use the probabilities from Equation 5 without transforming them into similarities. 



I 1 else 



" XY >o 

(5) 



Y[p{*>cf Y ) 

4=1 



1 1/w 

(6) 



d (X,Y) 

Here, in the calculation of Cf Y the occurrence counts of fc-mers are filtered to remove self- 
overlapping instances, thereby justifying the Poisson assumption. F p refers to the Poisson prob- 
ability distribution function and its parameter £, is the expected count under a set of equilibrium 
frequencies. P(x > Cf Y ) is the probability that we observe a fc-mer count at least as high as that 
between X and Y . 

The last word-based alignment-free method considered here is the composition distance of 
Hao and Qi (2004). Under a Markov model of order k — 2we predict the probability p° of a word 
(the Ci refer to its characters) from the probabilities of appropriate shorter subwords, respective 
their relative frequencies. To get the expected count E x of a fc-mer in X, we re-arrange the 
corresponding total numbers. 

0/ ^_P(ci,...,C k -i)p(c 2 ,...,C k ) 

p (Cl,...,C k ) = r (7) 

p(c 2 ,...,Cjfc_l) 
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x c x (ci,...,c k ^i)c x (c 2 ,...,c k ) (n-k+l)(n-k + 3) 
t {a,...,c k ) = y? \ 7 / , n\2 W 

We can now assemble the composition vector (Hao & Qi, 2004) for fc-mer occurrence counts 
c x in X: \ x = (c^ — E x ) fE x . Then we calculate the correlation between X and Y as the cosine 
of the angle between their composition vectors, and obtain a normalized dissimilarity d c . 

cos(X,7) = U = l 1 ' — (9) 

[Ltl(vf) 2 xir=l(vf) 2 ] 1/2 

^=1^x1 (10) 

The W-metric due to Vinga et al. (2004) is "word-based" but operates on 1-mers only: 

w w 

d w (x, Y) = £ £ (f x - fj ) ■ (ff - fj) ■ Wij (11) 

i=ij=i 

Differences in amino acid composition, ff — ff , between all pairs of amino acids, are 
weighted by their entries in matrix W. Vinga et al. (2004) found their results virtually the 
same for different scoring matrices (BLOSUM62, BLOSUM50, BLOSUM40 and PAM250); we 
use BLOSUM62 (Henikoff & Henikoff, 1992). 

Otu and Sayood (2003) showed how Lempel-Ziv complexity, computed in a simple fash- 
ion utilizing two elementary operations (Lempel & Ziv, 1976), can be used to define distance 
measures. We examine their final measure (that they call dj*): c(X) denotes the Lempel-Ziv 
complexity of X, and XY refers to the concatenation of X and Y . 

d LZ (X Y) = C ( XY )-< X )+ C ( YX )- C ( Y ) (U) 
{ ' ' \[c{XY)+c{YX)] 

Most recently, Burstein et al. (2005) proposed the Average Common Substring (ACS) ap- 
proach. They define L(X, Y) = Y!i=\ Y / n > where if Y is the length of the longest string starting 
at Xi that exactly matches a string starting at Yj. L provides a normalized length measure, from 
which we obtain an intermediate (asymmetric) distance d and finally d ACS . 

v ' ' L{XJ) L(X,X) v 
d ACS (X, Y)= X - [d(X, Y) + d(Y,X)] (14) 



3 Pattern-based approach 

We use pattern discovery to find regions of similarity (presumed homology) occurring in two or 
more sequences; no alignment is necessary. To estimate phylogenetic distances, the patterns are 
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considered to be local alignments. Adopting this point of view enables us to apply an established 
maximum likelihood (ML) approach. Both the application of pattern discovery, and distance 
estimation by ML, represent novel steps in this context. We infer pairwise distances from the 
sequence data covered by patterns, yielding a distance matrix. Distance-based tree inference 
then proceeds by conventional means. 

3.1 Terminology 

We briefly introduce some basic terminology for TEIRESIAS; for more details, including a 
description of the algorithm, the reader is referred to (Rigoutsos & Floratos, 1998). Let A de- 
note the alphabet of characters, e.g. all amino acids. Let '.' ^ A be the wildcard character that 
represents any amino acid. Define a pattern P to be the regular expression A(A U {'.'})* A. A 
subpattern of P is any substring that is a pattern. Call P a < L, W > pattern (L < W) if any subpat- 
tern of length > W has > L characters e A. A pattern P has support K if it occurs (has instances) 
in K sequences. A pattern P can be made more specific by replacing wildcard characters by char- 
acters G A and/or extending P to the left or right. Call P maximal with respect to a sequence set 
S if making P more specific reduces its total support (irrespective of the number of sequences). 

We are now ready to state the behaviour of the algorithm: TEIRESIAS finds all maximal 
< L, W > patterns (with support > K) in a set S of unaligned sequences. 

3.2 Distance calculation 

The pairwise distance d PB between two sequences S; and Sj is computed as follows: first, we 
filter out patterns that occur more than once in any sequence. This removes false positives and 
ensures that the self-distance of any sequence is zero. Second, all instances of patterns occurring 
simultaneously in (at least) Si and Sj are concatenated, resulting in two new sequences S\j and 
S'ji of the same length. Note that a pattern may occur in three or more sequences, in which case 
we project it on multiple pairs of sequences. Also note that generally • (and S^) will differ 
in length from SL (and SL) because the patterns occurring simultaneously in (at least) 5, and Sj 
will differ in number (and number of residues they cover) from those appearing in St and Sj. 
Third, these new, concatenated sequences are used to estimate pairwise distances. This is done 
by applying a maximum likelihood-based approach that optimizes with respect to a model of 
amino acid evolution. For the purpose of this work we use the JTT model (Jones et al., 1992) as 
implemented in PROTDIST from the PHYLIP package (Felsenstein, 2005). 

Note that the algorithm for calculating distances from patterns is general. Our means for 
pattern discovery is TEIRESIAS, but, in principle at least, other tools could also be used (eg 
Apostolico et al., 2005). To retain the alignment-free property of our approach, any replacement 
needs to have that property as well. 
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3.3 Parametrization 



Our rationale for parameterizing TEIRESIAS with L = 4, W = 16 is as follows. Consider 
ordinary fc-mers: higher values for k reduce chance occurrences among a set of sequences, thus 
reducing false positives. We observe that TEIRESIAS patterns and £-mers bear a relationship; 
to this end we introduce elementary patterns: an elementary pattern is a < L, W > pattern with 
exactly L residues. TEIRESIAS discovers maximal < L, W > patterns using elementary patterns 
as building blocks during its convolution phase (Rigoutsos & Floratos, 1998). For the special 
case W = L (no wildcard characters are allowed), we find that setting L = k leads to elementary 
patterns capturing a subset of all fc-mers. The only difference is that K = 1 for fc-mers (a k- 
mer may occur only once) whereas we use K = 2 for TEIRESIAS (a pattern must occur in at 
least two sequences). Thus we see that higher values for L reduce the number of false positives. 
For our distance calculation, however, we need patterns capable of accounting for differences 
between sequences, hence we require W > L. In preliminary experiments on data described in 
Section 4.2, we tried several higher values for L with W > L first. We found for L = 4, W — 16 
(a ratio of L/W = 0.25), the values that we use throughout Section 4, all pairwise distances 
are defined, i.e. every pair of sequences is covered by at least one instance of a pattern. For 
W — 8, corresponding to a ratio of L/W = 0.5, and higher values of W, approaching the ratio 
L/W = 0.25, the number of undefined distances is 229, 127, 63, 32, 23, 8, 5, and 2 out of 8667. 
(On data from Section 4.1, all distances are defined for W = 8.) 

Undefined distances point towards a problem: some sequence pairs are too divergent — no 
pair of substrings can be described by (elementary) <L, W > patterns. The ratio L/W determines 
the minimum similarity any subpattern must possess: it effectively specifies a local similarity 
threshold. Thus, undefined distances mean that no pair of substrings reach or exceed this thresh- 
old. Our solution to the problem is to make sequences more similar by encoding them in a 
reduced alphabet. Following Rigoutsos et al. (2000) and earlier work by Taylor (1986), we 
choose a reduction based on chemical equivalences: [AG], [DE], [FY], [KR], [ILMV], [QN], 
[ST], [BZX] where '[...]' groups similar amino acids together, and unlisted amino acids form 
classes of their own. The phylogenetic distance calculation is based on the original sequence data 
covered by the resulting patterns; this usually improves phylogenetic accuracy (see Section 4.1.1 
and 4.2.2). As a result of encoding sequences, all pairwise distances for e.g. L = 4, W = 8 are 
defined. 

We also find that for sufficiently small values of L, the phylogenetic accuracy is virtually in- 
dependent of the particular choice of L, and largely depends on the ratio L/W (data not shown). 
Generally, the accuracy of tree reconstruction improves as the local similarity threshold is low- 
ered, with diminishing improvement and higher computational costs the further it is lowered. 

3.4 Majority consensus and consistency 

One property of TEIRESIAS is that each residue can (and given our parametrization, usually 
will) participate in multiple patterns. This may lead to situations where a particular residue in 
one sequence pairs with two or more different residues in a second sequence. It is not clear how 
this should be interpreted with respect to homology. We propose a variant, d PBMC that resolves 
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this conflict by way of (relative) majority consensus and consistency. We discover patterns as 
before but introduce an intermediate step before distance estimation. We record paired positions 
across all patterns. For any two sequences 5, and Sj, we take positions (pi,Pj) if and only if a) pi 
is paired with p'- more often than with any other position in Sj, b) pj is paired with p ■ more often 
than with any other position in 5„ and c) pi = p\ and pj = p'j, i.e. the positions are the same. 
This ensures that every residue participates at most once for a given sequence pair in the distance 
calculation step. For parameters L = 4, W = 16, the constraints prove to be stringent and discard 
most of the data. 

4 Comparison of alignment-free methods 
4.1 Synthetic data 

We use a birth-death process to model cladogenesis (Nee et al., 1994) and sample from several 
tree distributions. The effects of different taxon sampling strategies are described in (Rannala 
et al., 1998). Trees resulting from a birth-death process are rooted, bifurcating and ultramet- 
ric; we deviate them from ultrametricity by an additive process to keep the expectation of the 
phylogenetic distances unchanged. 

Using PhyloGen VI. 1 (Rambaut, 2002) we sampled seven sets of 100 four-taxon ref- 
erence trees each; the parameters were birth = 10.0 and death = 5.0, with extant G 
[40, 133, ... , 40000] corresponding to a sample fraction of [0. 1 , 0.03, . . . , 0.0001] . The induced 
pairwise phylogenetic reference distances have medians of [0.7 1,1.11,1.61, 2.08, 2.46, 2.96, 3.39] 
substitutions per site; their upper and lower quartiles are within 0.35 units of these values. For 
later use, we label the first, fourth and last set as having small, medium and large phylogenetic 
distances. Sequences were evolved along the branches of the deviated trees using SEQ-GEN 
(Rambaut & Grassly, 1997) VI. 3. 2 under the JTT model (Jones et al., 1992), and for a sequence 
length of 1000 amino acids. (Where possible, we parameterized alignment-free methods with 
the JTT model, or its equilibrium frequencies.) 

To compare alignment-free methods with alignment-based methods when the assumption of 
collinearity is violated, we constructed an additional dataset with a wide distribution of phyloge- 
netic distances. We sampled one four-taxon tree each from 100 different distributions specified 
by sample fractions that varied evenly on a logarithmic scale. The induced pairwise phylogenetic 
reference distances have a median of 1 .77 substitutions per site; the upper and lower quartiles are 
2.54 and 1.02, respectively, and the maximum is 4.88. Sequences of length 1000 were evolved 
as before, and for every sequence set the first and last halves of two sequences were exchanged. 
This corresponds to a recent domain shuffle event. We deliberately chose an extreme example to 
show the severity that a non-justified assumption of collinearity can have. 

The generated sequences were input to the tested alignment-free methods, and the resulting 
test distances were used to infer neighbor-joining (NJ: Saitou & Nei, 1987) trees. Phylogenetic 
accuracy is measured by the Robinson-Foulds (RF: Robinson & Foulds, 1981) tree metric. We 
compute the topological difference between a test tree and its corresponding (unrooted) reference 
tree, and report results for each set. To assess the statistical significance of differences between 
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methods we employ the Friedman test (corrected for tied ranks), followed by Tukey-style posthoc 
comparisons if a significant difference is found (see e.g. Siegel & Castellan, 1988). 

4.1.1 Phylogenetic accuracy 

Here, and in Section 4.2.2 we are interested in the accuracy of methods in reconstructiong the 
phylogentic relationships among a set of sequences; we refer to this quantity as phylogenetic 
accuracy for short. We measure and report the topological differences between test and reference 
trees: better methods yield fewer differences, and hence have a higher accuracy. When we assess 
methods based on their ranksums of the Friedman test, better methods obtain lower numbers and 
rank first. 

Influence of k and alphabet Here, we look at the performance of word-based alignment-free 
methods as a function of the length of fc-mers and the alphabet in use. We varied k from 1 to 9 
where possible: the composition distance requires a minimum of k = 3 . The alphabet consisted of 
either the original amino acids (A A) or the chemical equivalence classes (CE) from Section 3.3. 

For AA sequences, word length k = 4 performs best for methods d E , d s , d F and d p as judged 
by their ranksums based on phylogenetic accuracy over all seven reference sets. Second- and 
third-ranking word lengths for d E and d p are k = 5 and k = 3. For d F these lengths have tied 
ranks, and for d s this order is reversed. Method d c performs best for k = 3, with k = 4 (k = 5) 
ranking second (third). 

For CE sequences, slightly higher values for k yield lower ranksums. Methods d E , d F and d p 
perform best with word length k = 5. Second- and third-ranking word lengths for d E and d p are 
k = 6 and k = l, for d E this order is reversed. For d s , word lengths k = 5 and k = 6 rank equal 
best, followed by k = 4. Again, method d c shows a preference for lower values: it performs 
jointly best for k = 4 and k = 5, followed by k = 3. 

What we have described so far is based on the ranksums over all seven reference sets span- 
ning the relevant space of phylogenetic distances for tree inference. Looking at the phylogenetic 
accuracy of word-based methods on individual sets with narrow distributions of phylogenetic dis- 
tances reveals a more complex picture. As expected, the topological difference between test and 
reference trees increases with increasing phylogenetic reference distances. However, depending 
on the choice of k, the absolute values of this difference may vary considerably. This leads to 
a number of curves with different shapes when plotting the accuracy for a particular method on 
the the Y-axis with the X-axis showing values for k (Figure 1). We find that overall, the choice of 
method has less impact on the shape of these curves than does phylogenetic distance. Comparing 
AA with CE sequences shows similarly shaped curves that are shifted to the right for CE. 

Figure 1 (a,c,e) shows curves for method d E for three out of seven different reference sets 
with small, medium and large phylogenetic distances. The curves for methods d s , d F and d p are 
similar and omitted here. The curve for medium distances corresponds to our overall findings. 
Small and large distances hint at a better performance for small values of k. Inspection of these 
plots for the remaining methods fails to identify a single best k. Figure 1 (b,d,f) for d c reveals 
some striking peculiarities of this method. In these three sets, the topological difference for k = 6 
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is often the highest, even though the neighbor value k = 5 may yield a low topological difference 
(medium distances): thus the parameter space is uneven, more so than for other methods. 

Taken together, these results indicate that, depending on the phylogenetic distance of the 
sequences, the word length most appropriate for analysis of a particular dataset may well be 
different from the one performing best over all sets tested here. 

Statistical significance Here, we conduct a comprehensive comparison of all methods: we 
show their phylogenetic accuracy and assign statistical significance to our findings. For fc-mer- 
based methods, we select the best-performing word lengths on the two alphabets as described 
above. We compare these methods to d ACS , d LZ and d w , and to two variants of the pattern- 
based approach: d PB and d PBMC . As a baseline, we include d ML , the maximum likelihood (ML) 
estimate of phylogenetic distances between the (already correctly aligned) sequences. 

Table 1 lists selected methods in ranksum order based on all 700 trees, from best to worst. 
For every method, we show the number of incorrectly reconstructed trees in each of the seven 
sets. Recall that unrooted, bifurcating four-taxon trees can be reconstructed either correctly or 
incorrectly: the RF distance will be or 1, with no intermediate values possible. 

The test statistic of the Friedman test (corrected for tied ranks) is Fr = 709.6 (N = 700, 
k = 19). This is highly significant (P < 10~ 10 ) beyond the a = 0.05 level. Significant differences 
are found between the following pairs (numbers refer to column '#' of Table 1): method 1 vs 
methods 19-4, method 2 vs methods 19-5, methods 3 and 4 vs methods 19-15, and methods 
5-16 vs method 19. Thus the performance of most alignment-free methods as tested here is 
statistically indistinguishable from one another. The ranksums of methods 5-14 range from 
6951.0 to 7074.5, differing by < 123.5. However, the pattern-based method d PB with CE, L = 4, 
W = 16 (ranksum: 6058.0) is significantly better than all alignment-free methods not based 
on patterns. The ML estimate based on the correct alignment, d ML , is significantly better than 
all traditional alignment-free methods and the pattern-based method working on original AA 
sequences. Our tests show that d ML is not significantly better than the two best-performing 
pattern-based variants working on CE sequences. 

By far the worst method tested here is the W-metric d w (ranksum: 8281.0): differences to 
nearly all other methods are significant. The lack of phylogenetic accuracy originates from being 
based on l-mers. For comparison, d E with AA, k = 1 incorrectly reconstructs the following 
number of trees for the seven reference sets: 38, 30, 39, 53, 58, 66, 59. These numbers are quite 
similar to the ones in Table 1, as are the numbers for equally parameterized methods d s and d F . 
In the case of d p , however, they are 59, 56, 65, 75, 59, 71, 65. This is an artifact of the method 
for k = 1 (and to some extent for k = 2) and vanishes for higher values. Also apparent is the poor 
performance of both parametrizations of d LZ and d c , the Lempel-Ziv and composition distances, 
respectively, with ranksums between 7359.5 and 7635.0. 

Domain shuffling We now describe our findings from the reference set with simulated domain 
shuffling data. We apply the same alignment-free methods with parameter settings as before on 
the unaligned, partly shuffled sequences. Additionally, we run a number of multiple sequence 
alignment (MSA) programs on this data (Thompson et al., 1994; Morgenstern, 1999; Edgar, 
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Figure 1: Average Robinson-Foulds distance (Y-axis) for two methods (a,c,e: d E ; b,d,f: d c ) 
on three reference sets (top to bottom: small, medium and large phylogenetic distances). Each 
subfigure shows the behaviour as a function of k (X-axis) under two alphabets (AA: original 
amino acids, CE: chemical equivalence classes). Points are joined for ease of visual inspection 
only. 
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55 


14 


7074.5 


d F 


CE 


5 


6 


6 


21 


31 


47 


49 


62 


15 


7359.5 


d LZ 


CE 




6 


5 


28 


39 


49 


66 


59 


16 


7378.5 


d LZ 


AA 




3 


8 


21 


34 


53 


64 


71 


17 


7597.0 


d c 


AA 


3 


12 


14 


31 


47 


49 


58 


66 


18 


7635.0 


d c 


CE 


5 


15 


14 


32 


45 


54 


63 


58 


19 


8281.0 


d w 


AA 


(1) 


43 


34 


35 


55 


50 


71 


61 



Table 1: Shown are the number of incorrectly reconstructed trees (out of 100) for each synthetic 
reference set and method. The order is based on ranksums; for each word-based method (and 
alphabet A), we include the best-performing k (method d w can only take on a value of 1). 
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2004b; Do et al., 2005), and estimate ML distances from these alignments (jClustalw^ jDialign^ 
jMUSCLE an( j jProbCons^ reS p ec tively). This corresponds to an undesirable situation where e.g. 
in an automated environment tests have failed to detect the presence of domain shuffling. Hence, 
distances are estimated from alignments where not all homologous residues can possibly be 
aligned, and it is likely that in fact a substantial fraction of non-homologous residues have been 
aligned. 

The Friedman test (Fr = 270.3, N = 100, k = 22) detects the presence of a difference that 
is highly significant (P < 10~ 10 ) beyond the a = 0.05 level. However, pairwise differences are 
statistically significant only between j ClustalW and all other methods; this is likely due to lack 
of statistical power. Two parametrizations (CE and AA) of the pattern-based method d PB , L = 4, 
W — 16, rank jointly first with ranksums of 1011.5: they reconstruct 11 out of 100 trees incor- 
rectly. This is followed jointly by d PBMC , with CE, L = 4, W = 16, and d s and d F , both with 
CE, k = 5. Their ranksums are 1066.5, and they reconstruct 16 trees incorrectly. The num- 
bers for alignment-based approaches are as follows (ranksum in parentheses): j ClustalW : 71 
(1671.5), jMUSCLE. 28 (1198.5), J ProbCons : 25 (1165.5), and J Dialign : 21 (1121.5). Three out 
of four alignment-based approaches are among the seven worst-ranking methods. Interestingly 
jDialign^ ^ best-performing of these approaches, uses a local alignment strategy; it occupies 
rank eleven jointly with two other methods. Conversely, what we just described means that e.g. 
d LZ , working on AA sequences, one of the worst-performing alignment-free methods as tested 
here, has a higher phylogenetic accuracy than three out of four combinations of MSA program 
and ML estimate, and even d w is significantly better than j ClustalW on this data. Overall, the 
results show that alignment-free methods may perform better than alignment-based approaches, 
especially on non-collinear sequence data, as alignment-free methods do not make assumptions 
of collinearity. 



4.2 Empirical data 

We use the data from version 2 of the original BAliBASE sets (Thompson et al., 1999). They 
consist of 141 manually curated benchmark alignments that are organized in five reference sets. 
Their purpose is to support tests of alignment tools under a variety of conditions: Set 1 is made up 
of roughly equi-distant sequences that are divided into nine subsets according to their sequence 
conservation and alignment length. Set 2 contains sequence families that are aligned with a 
highly divergent orphan sequence. Set 3 aligns subgroups with less than 25 percent identity be- 
tween them. Set 4 consists of sequences with N- or C-terminal extensions, i.e. the sequences are 
not trimmed at alignment boundaries. Set 5 is complementary to set 4: some sequences contain 
internal insertions. Two alignments contain only three sequences each and are not considered for 
evaluation of phylogenetic accuracy, as there is only one corresponding unrooted tree topology. 
The remaining 139 alignments consist of between 4 and 28 sequences each. 

For each reference alignment, we estimate phylogenetic reference distances using Prot- 
dist, and reconstruct both neighbor-joining (NJ) and Fitch-Margoliash (FM: Fitch & Margo- 
liash, 1967) reference trees. The topological difference between a test tree and its corresponding 
reference tree is measured by the Robinson-Foulds (RF) and the Quartet (Q) distance (Estabrook 
et al., 1985; implemented in QDlST: Mailund & Pedersen, 2004). As these are empirical data, 
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we cannot know the true tree along which the sequences evolved; however, we find that by using 
a large number of trees, and four combinations of tree reconstruction method and tree topol- 
ogy metric (RF-NJ, RF-FM, Q-NJ, Q-FM), we are able to rank methods robustly. Statistical 
significance is assessed as in Section 4.1. 

4.2.1 Phylogenetic distances 

Here, we inspect the behaviour of pairwise phylogenetic distances. The BAliBASE alignments 
yield 8667 reference distances, of which < 2.5% have > 5.0 substitutions per site (36 distances 
are > 10.0). In what follows, we consider reference distances < 5.0, i.e. < 500 PAMs. Note 
that "distances of 250-300 PAM units are commonly considered as the maximum for reasonable 
distance estimation" (Sonnhammer & Hollich, 2005). Figure 2 contains scatterplots where the 
X-axis refers to the afore-mentioned reference distances. The Y-axis shows the corresponding 
phylogenetic distances obtained using selected alignment-free methods with original AA se- 
quences (and, if the methods are word-based, values for k as in Table 1). Additionally, we show 
distances obtained from CE sequences where the distribution differs noticably. 

Figure 2a for d PB with CE sequences, L = 4, W = 16, shows a linear relationship between 
reference and estimated distances for up to about 2.5 substitutions per site. Linear regression 
for all points below this cutoff yields y = 0.2598 + 1.1092.x: with a correlation coefficient (CC) 
of 0.8188. Higher distances are increasingly underestimated as saturation comes into effect and 
limits most distances (> 99%) to values < 3.5. 

When patterns are discovered using the original BAliBASE sequences (Figure 2b) as opposed 
to using CE, the resulting distances are approximated by a line with a lesser slope (y = 0.1373 + 
0.7160.x;, CC= 0.8261); saturation limits most distances (> 99%) to values < 2.5. For the variant 
d PBMC with CE sequences, L = 4, W = 16 (Figure 2c), the linear relationship extends at least 
up to 2.5 substitions per site and the slope is roughly half that of the equally parameterized d PB 
(y = 0.2535 + 0.58 19.x, CC= 0.8564). Again, most distances (> 99%) are limited to values 
< 3.5, however this curve shows less saturation and more scatter. 

The (squared) Euclidean distance d E does not yield values that can be interpreted in units of 
substitutions per site. Instead, they relate to mismatch counts (Blaisdell, 1989) and are therefore 
sequence length-dependent. Correspondingly, the distances do not show a single discernable 
linear relationship (Figure 2d for k = 4, with AA sequences). Most of the data (> 99%) have an 
Euclidean distance of < 1200; data for k = 5, with CE sequences are very similar and omitted 
here. 

Similarly, d s has no single discernable linear relationship, with most data (> 99%) taking on 
numerical values of < 200 (k = 4, AA, Figure 2e). However, parameters k = 5, CE yield distances 
with most values (> 99%) being < 4000, although the scatterplots look almost identical. 

We find a linear relationship between reference and d F distances for up to about 1 .0 sub- 
stitutions per site (Figure 2f). Linear regression for all points below this cutoff yields y = 
0.3681 + 1.5088*, CC= 0.8626. We find 25.5% of all pairwise distances are -log(e), i.e. would 
be undefined without adding £ as no fc-mer is common to a sequence pair. This problem occurs 
especially for short, divergent sequences in set 1 of BAliBASE. For k = 5, with CE sequences, 
this number drops down considerably to 15.1%, and the plot exhibits more scatter. 
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Figure 2: Pairwise phylogenetic reference distances (X-axis) plotted against corresponding cal- 
culated distances (Y-axis). Methods and parameters are as follows: a) d PB with L = 4, W = 16, 
CE, b) d PB with L = 4, W = 16, AA, c) d PBMC with L = 4, W = 16, CE, d) d E with k = 4, AA, 
e) d s with = 4, AA, f) J F with = 4, AA (cont'd). 
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Figure 2: (cont'd) g) d p with k = 4, AA, h) d c with k = 3, AA, i) d c with k = 5, CE, j) d w , k) 
d iZ with AA, 1) d ACS with AA. 
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A similar problem occurs for d p (Figure 2g), which is also based on common fc-mers. Here 
30.8% of all pairwise distances yield the value 1.0; this higher percentage is possibly caused by 
numerical instabilities when multiplying small probabilities. As before, parametrization with 
k = 5, CE reduces the percentage considerably, to 21.0%, and increases scatter. 

The scatterplots for the composition distance d c differ between parametrization k = 3, AA 
and k = 5, CE (Figure 2h,i). The former version shows a linear behaviour for up to about 1.0 sub- 
stitutions per site (y = 0.2946 + 0.1906*, CC= 0.7881), with no value > 0.55, whereas the latter 
version shows no such large linear behaviour and is limited to about 0.75. More importantly, 
in this latter version 13.1% of its distances are exactly 0.5 (within the usual precision limits im- 
posed by implementations of floating point numbers), and 30.2% are G [0.4999,0.5001]. For the 
former version, just 6 distances amount to 0.5, and 60 are G [0.4999,0.5001]. 

We find no discernable relationship between reference distances > 0.5 and distances pro- 
duced by the W- metric (Figure 2j). This likely explains why it turns out to be the worst of all 
tested methods here. The distances are mostly (> 99%) limited to values < 0.3. 

The two distributions of values for parametrizations of the Lempel-Ziv distance d LZ with 
both AA (Figure 2k) and CE sequences are very similar, with CE showing more scatter. 

The method d ACS shows a linear relationship between reference and calculated distances for 
up to about 1.0-1.5 substitutions per site (Figure 21). Linear regression for all points below 1.5 
yields y = 0.5618 + 0.5631* (CC= 0.8643). We find most distances (> 99%) are limited to 
values < 1.65. For CE sequences, most distances (> 99%) are limited to values < 1.1. 

Comparing the various distributions, we find that all three versions of our pattern-based ap- 
proach yield pairwise distances that exhibit a linear relationship to phylogenetic reference dis- 
tances for up to about 2.5 substitutions per site. This constitutes a considerable increase from a 
maximum of 1.0-1.5 substitutions per site for methods d ACS , d F and d c . The linear relationship 
is a desirable property, and likely explains the higher phylogenetic accuracy of the pattern-based 
approach. 

4.2.2 Phylogenetic accuracy 

Statistical significance Table 2 lists selected alignment-free methods and four approaches 
based on the ML distance estimate from automated alignments. To obtain combined ranksums 
for statistical analysis, we average the normalized topological differences over all four combi- 
nations of tree reconstruction method and tree distance measure (RF-NJ, RF-FM, Q-NJ, Q-FM) 
based on 139 sequence sets. The combined ranksums range from 975.5 to 2338.0. This is a 
slightly wider range than for each individual combination, for which the extreme values are 
G [987.5, 1058.5] and G [2280.0,2312.0]. The first and last six methods are ranked identically 
between the combined analysis and combination RF-NJ. The average normalized topological 
differences for this combination are shown in Table 2 for all five BAliBASE reference sets. 

The Friedman test statistic F R = 579.2 (N = 139, k = 22) is highly significant (P < 10 10 ) 
beyond the a = 0.05 level. Significant differences are found between the following pairs (num- 
bers refer to column '#' of Table 2): methods 1-3 vs methods 22-6, method 4 vs methods 22-9, 
method 5 vs methods 22-12, methods 6 vs methods 22-20, methods 7-18 vs methods 22 and 
21, and method 19 vs method 22. This implies that all alignment-based approaches yield signif- 
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BAliBASE reference set 



# 


Ranks um 


Method 


/I 

Ji 


k 


1 
1 


I 


5 


4 


c 

J 


1 


975.5 


jMUSCLE 


A A 

AA 




A O /I A 

0.240 


A OTA 

0.370 


A T7 y1 

0.274 


A A A O 

0.442 


A O A A 

0.244 


2 


1 AAC (\ 

1005.0 


^ClustalW 


AA 




A O 1 A 

0.210 


A O O A 

0.389 


a on 

0.337 


0.423 


A O /I A 

0.249 


3 


1 /"\ f\C\ / \ 

1 008.0 


^ProbCons 


A A 

AA 




0.204 


0.396 


0.336 


0.414 


0.164 


A 

4 


1 1 f\f\ c 

1 190.5 


^DlALIGN 


A A 

AA 




0.310 


0.428 


0.399 


0.646 


0.270 


5 


1239.0 


d rD 


CE 




0.306 


0.510 


0.357 


0.478 


0.330 


6 


1 /I CO A 

1453.0 


aPB 

a 


A A 

AA 




A A A A 

0.404 


A C/^O 

0.563 


A 1 AO 

0.398 


A CO /I 

0.524 


0.428 


7 


1 CIA A 

1570.0 


jPBMC 

a 


Cb 




A A A A 

0.440 


A ccn 

0.557 


A /I O O 

0.428 


A /^AA 

0.609 


A /I A 

0.460 


o 



1 COO A 

1583.0 


a 


Cb 




0.394 


A COI 

0.583 


0.433 


A C A 1 

0.591 


0.366 


9 


1603.0 


d p 


CE 


5 


0.408 


0.570 


0.442 


0.568 


0.412 


10 


1625.5 


d LZ 


AA 




0.431 


0.569 


0.389 


0.642 


0.511 


11 


1632.5 


d E 


CE 


5 


0.408 


0.593 


0.464 


0.570 


0.410 


12 


1646.0 


d F 


CE 


5 


0.396 


0.575 


0.467 


0.569 


0.400 


13 


1703.0 


d ACS 


AA 




0.483 


0.579 


0.401 


0.660 


0.451 


14 


1705.0 


d E 


AA 


4 


0.508 


0.578 


0.418 


0.622 


0.489 


15 


1706.5 


d LZ 


CE 




0.421 


0.622 


0.440 


0.628 


0.437 


16 


1707.5 


d p 


AA 


4 


0.496 


0.589 


0.419 


0.637 


0.469 


17 


1751.5 


d F 


AA 


4 


0.515 


0.580 


0.431 


0.666 


0.475 


18 


1755.0 


d s 


CE 


5 


0.446 


0.636 


0.491 


0.603 


0.375 


19 


1830.0 


d s 


AA 


4 


0.513 


0.624 


0.450 


0.607 


0.528 


20 


1968.5 


d c 


AA 


3 


0.481 


0.681 


0.525 


0.570 


0.588 


21 


2171.0 


d c 


CE 


5 


0.535 


0.776 


0.642 


0.796 


0.611 


22 


2338.0 


d w 


AA 


(1) 


0.585 


0.885 


0.795 


0.897 


0.720 



Table 2: Shown are average topological differences for each BAliBASE reference set and 
method; these average values are based on neighbor-joining trees and the normalized Robinson- 
Foulds measure. The order is based on combined ranksums (see text for details); parameters for 
word-based methods are as in Table 1 . 
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icantly better results than any of the alignment-free methods not based on patterns, except for 
jDialign vs jAC5 w ifa CE. Additionally, three out of four alignment-based approaches (ranksums: 
975.5-1008.0) are significantly better-performing than two pattern-based variants although not 
than d PB with CE, L = 4, W = 16 (ranksum: 1239.0). This version significantly outperforms all 
but four alignment- free methods not based on patterns. Again, both parametrizations of the com- 
position distance and the W-metric trail behind, with ranksums of 1968.5, 2171.0 and 2338.0, 
respectively. Similarly to our previous analysis, most alignment-free methods are statistically in- 
distinguishable. Ranksums for methods 8-18 range from 1583.0 to 1755.0, a difference of 172.0. 
On this dataset, d PBMC is only marginally better (ranksum: 1570.0). A possible explanation is 
apparent from Table 1 . There, d PBMC performs poorly on reference set 7 (large phylogenetic 
distances) in comparison to both parametrizations of d PB . We find 1986 out of 8667 pairwise 
phylogenetic reference distances (i.e. 22.9%) in BAliBASE have > 3.0 substitutions per site. 

5 Conclusions 

We present here for the first time a comprehensive evaluation of alignment-free methods with 
respect to their accuracy in reconstructing the phylogenetic relationship among a set of se- 
quences. We show that the performance of most methods is statistically indistinguishable from 
another. The pattern-based approach as introduced by us here proved to be significantly better 
than most previously established methods. At the same time, we provide a point of reference for 
alignment-free methods by measuring the maximum likelihood (ML) distance estimate based on 
reference and automated alignments. In our tests, we found the best-performing version of our 
pattern-based approach d PB to be statistically indistinguishable from this estimation, while most 
alignment-free methods rank significantly worse on ordinary, non- shuffled sequences. However, 
on non-collinear sequences we show that most alignment-free methods reconstruct trees more ac- 
curately than approaches based on automated alignments. In fact, these alignments should not be 
used as they largely align non-homologous residues. The inspection of ClustalW alignments 
reveals artifacts of this method: it forces most residues to align with other (non-homologous) 
residues, and places too few gaps. 

In all three experiments we found that d PB ranks higher than the equally parameterized variant 
d PBMC , although not significantly. The latter variant intuitively seems to do more justice to 
the concept of homology; however, we cannot provide a satisfying explanation for its worse 
perfomance. All three versions of our pattern-based approach result in distances that show a 
linear relationship to phylogenetic reference distances over a substantially longer range than any 
other alignment-free method considered here. 

We also utilized a different alphabet for amino acid (AA) sequences based on chemical equiv- 
alences (CE). We found that d PB with CE yields results as good as d PB with AA, and often yields 
considerably increased phylogenetic accuracy. We also tested the other alignment-free methods 
on sequences encoded in this alphabet. For any given parametrization, CE always improves per- 
formance on set 7 (large phylogenetic distances: cf Table 1, and also Figure 1 e,f) by 4 to 12 
(out of 100) less incorrectly reconstructed trees. This probably explains why methods parame- 
terized with CE vs AA perform better on BAliBASE than on the synthetic dataset. Note that we 
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did not try to optimize the alphabet; certainly, there are many different choices (see eg Edgar, 
2004a). Also, our findings seem to contradict results of that study. Edgar (2004a) found that k- 
mers based on various compressed alphabets did not improve the correlation coefficient between 
d F and percent identity as compared to using the original alphabet. In our own experiments we 
found the correlation coefficient between estimated and reference distance to be a bad estimator 
of phylogenetic accuracy (data not shown). 

Finally, based on the data in Table 1, we note that there is ample room for further improve- 
ment of alignment-free methods: compare the results for d ML with d PB , especially on reference 
sets 5 to 7, i.e. large phylogenetic distances. Quite likely this will be possible only if new 
alignment-free methods incorporate models of sequence change. 

Acknowledgement 

M.H. received a Graduate Student Research Travel Award from the University of Queensland. 
ARC grant CE0348221 funded part of the research. 

References 

Apostolico, A., Comin, M., & Parida, L. (2005). Conservative extraction of over-represented ex- 
tensible motifs. In Proceedings of the 13th International Conference on Intelligent Systems 
for Molecular Biology (ISMB 2005) (pp. 223-233). 

Blaisdell, B. (1986). A measure of the similarity of sets of sequences not requiring sequence 
alignment. Proc. Natl Acad. Sci. U.S.A., 83(14), 5155-5159. 

Blaisdell, B. (1989). Average values of a dissimilarity measure not requiring sequence alignment 
are twice the averages of conventional mismatch counts requiring sequence alignment for 
a computer-generated model system. J. Mol. Evol, 29(6), 538-547. 

Burstein, D., Ulitsky, I., Tuller, T, & Chor, B. (2005). Information theoretic approaches to whole 
genome phylogenies. In Proceedings of the Ninth Annual International Conference on Re- 
search in Computational Molecular Biology (RECOMB 2005) (pp. 283-295). Cambridge, 
MA. 

Do, C, Mahabhashyam, M., Brudno, M., & Batzoglou, S. (2005). ProbCons: Probabilistic 
consistency-based multiple sequence alignment. Genome Res., 15(2), 330-340. 

Edgar, R. (2004a). Local homology recognition and distance measures in linear time using 
compressed amino acid alphabets. Bioinformatics, 32(1), 380-385. 

Edgar, R. (2004b). MUSCLE: multiple sequence alignment with high accuracy and high through- 
put. Nucleic Acids Res., 32(5), 1792-1797. 

Estabrook, G., McMorris, E, & Meacham, C. (1985). Comparison of undirected phylogenetic 
trees based on subtrees of four evolutionary units. Syst. Zool, 34(2), 193-200. 

Felsenstein, J. (2005). PHYLIP (phylogeny inference package) version 3.65. Distributed by the 
author. (Department of Genome Sciences, University of Washington, Seattle) 

Fitch, W., & Margoliash, E. (1967). Construction of phylogenetic trees. Science, 155, 279-284. 



19 



Gentleman, J., & Mullin, R. (1989). The distribution of the frequency of occurrence of nucleotide 

subsequences, based on their overlap capability. Biometrics, 45(1), 35-52. 
Hao, B., & Qi, J. (2004). Prokaryote phylogeny without sequence alignment: from avoidance 

signature to composition distance. J. Bioinf. and Computat. Biol, 2(1), 1-19. 
Henikoff, S., & Henikoff, J. (1992). Amino acid substitution matrices from protein blocks. Proc. 

Natl Acad. Sci. U.S.A., 89(22), 10915-10919. 
Jones, D., Taylor, W., & Thornton, J. (1992). The rapid generation of mutation data matrices 

from protein sequences. Comput. Appl. Biosci., 8(3), 275-282. 
Lempel, A., & Ziv, J. (1976). On the complexity of finite sequences. IEEE Trans. Inform. 

Theory, IT-22, 75-81. 

Mailund, T, & Pedersen, C. (2004). Qdist — quartet distance between evolutionary trees. Bioin- 

formatics, 20(10), 1636-1637. 
Morgenstern, B. (1999). DIALIGN2: Improvement of the segment-to- segment approach to 

multiple sequence alignment. Bioinf ormatics, 15(3), 211-218. 
Nee, S., May, R., & Harvey, P. (1994). The reconstructed evolutionary process. Phil. Trans. R. 

Soc. B, 344(1309), 305-311. 
Otu, H., & Sayood, K. (2003). A new sequence distance measure for phylogenetic tree recon- 
struction. Bioinf ormatics, 79(16), 2122-2130. 
Pham, T, & Zuegg, J. (2004). A probabilistic measure for alignment-free sequence comparison. 

Bioinf ormatics, 20(18), 3455-3461. 
Rambaut, A. (2002). PhyloGen: Phylogenetic tree simulator package. (Available from http : 

/ / evolve . zoo . ox . ac . uk/ sof t ware /PhyloGen /ma in . html) 
Rambaut, A., & Grassly, N. (1997). Sequence-Generator: An application for the monte carlo 

simulation of molecular sequence evolution along phylogenetic trees. Comput. Appl. 

Biosci., 13, 235-238. 

Rannala, B., Huelsenbeck, J., Yang, Y., & Nielsen, R. (1998). Taxon sampling and the accuracy 

of large phylogenies. Syst. Biol., 47(A), 702-710. 
Rigoutsos, I., & Floratos, A. (1998). Combinatorial pattern discovery in biological sequences: 

the TEIRESIAS algorithm. Bioinf ormatics, 14(1), 55-67. (Published erratum appears in 

Bioinf ormatics, 14(2):229) 
Rigoutsos, I., Floratos, A., Parida, L., Gao, Y, Piatt, D., & Huynh, T. (2000, 10). TEIRESIAS: 

A vade mecum. IBM TJ Watson Research Center. (Available from http : / /cbcsrv . 

wat son . ibm . com/Tspd . html) 
Robinson, D. F., & Foulds, L. R. (1981). Comparison of phylogenetic trees. Math. Biosci., 53, 

131-147. 

Saitou, N., & Nei, M. (1987). The neighbor-joining method: a new method for reconstructing 

phylogenetic trees. Mol. Biol. Evoi, 4(4), 406-425. 
Siegel, S., & Castellan, N., Jr. (1988). Nonparametric statistics for the behavioral sciences (2nd 

ed.). Boston, Massachusetts: McGraw-Hill. 
Snel, B., Huynen, M., & Dutilh, B. (2005). Genome trees and the nature of genome evolution. 

Annu. Rev. Microbiol, 59, 191-209. 
Sonnhammer, E., & Hollich, V. (2005). Scoredist: A simple and robust protein sequence distance 

estimator. BMC Bioinf ormatics, 6, 108. 



20 



Taylor, W. (1986). The classification of amino acid conservation. J. Theor. Biol, 119(2), 205- 
218. 

Thompson, J., Higgins, D., & Gibson, T. (1994). CLUSTALW: Improving the sensitivity of 
progressive multiple sequence alignment through sequence weighting, position specific 
gap penalties and weight matrix choice. Nucleic Acids Res., 22(22), 4673-4680. 

Thompson, J., Plewniak, R, & Poch, O. (1999). BAliBASE: a benchmark alignment database 
for the evaluation of multiple alignment programs. Bioinformatics, 15(1), 87-88. 

Van Helden, J. (2004). Metrics for comparing regulatory sequences on the basis of pattern 
counts. Bioinformatics, 20(3), 399-406. 

Vinga, S., & Almeida, J. (2003). Alignment-free sequence comparison — a review. Bioinformat- 
ics, 19(4), 513-523. 

Vinga, S., Gouveia-Oliveira, R., & Almeida, J. (2004). Comparative evaluation of word compo- 
sition distances for the recognition of SCOP relationships. Bioinformatics, 20(2), 206-215. 

Wu, T.-J., Burke, J., & Davison, D. (1997). A measure of DNA sequence dissimilarity based on 
the Mahalanobis distance between frequencies of words. Biometrics, 53(4), 1431-1439. 

Wu, T.-J., Huang, Y.-H., & Li, L.-A. (2005). Optimal word sizes for dissimilarity measures and 
estimation of the degree of dissimilarity between DNA sequences. Bioinformatics, 21(22), 
4125-4132. 



21 



